An efficient memory operations optimization technique for vector loops on Itanium 2 processors
نویسندگان
چکیده
To keep up with a large degree of instruction level parallelism (ILP), the Itanium 2 cache systems use a complex organization scheme: load/store queues, banking and interleaving. In this paper, we study the impact of these cache systems on memory instructions scheduling. We demonstrate that, if no care is taken at compile time, the non-precise memory disambiguation mechanism and the banking structure cause severe performance loss, even for very simple regular codes. We also show that grouping the memory operations in a pseudo-vectorized way enables the compiler to generate more effective code for the Itanium 2 processor. The impact of this code optimization technique on register pressure is analyzed for various vectorization schemes. keywords Performance Measurement, Cache Optimization, Memory Access Optimization, Bank Conflicts, Memory Address Disambiguation, Instruction Level Parallelism.
منابع مشابه
Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers
This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to t...
متن کاملDevelopment of High Performance Software Distributed Shared Memory System for Vector Processing
Parallel implementation of basic linear algbra operations for sparse matrix algorithms is a critical problem on shared memory architectures with finite memory bandwidth. We discuss the parallelizing methodology of vector processing and evaluate its performance on some commercially available shared memory systems. From the results of the evaluation, we hypothesize the most critical issue in buil...
متن کاملEfficient Exploitation of Hyper Loop Parallelism in Vectorization
Modern processors can provide large amounts of processing power with vector SIMD units if the compiler or programmer can vectorize their code. With the advance of SIMD support in commodity processors, more and more advanced features are introduced, such as flexible SIMD lane-wise operations (e.g. blend instructions). However, existing vectorizing techniques fail to apply global SIMD lane-wise o...
متن کاملPerformance comparison of data-reordering algorithms for sparse matrix-vector multiplication in edge-based unstructured grid computations
Several performance improvements for finite-element edge-based sparse matrix–vector multiplication algorithms on unstructured grids are presented and tested. Edge data structures for tetrahedral meshes and triangular interface elements are treated, focusing on nodal and edges renumbering strategies for improving processor and memory hierarchy use. Benchmark computations on Intel Itanium 2 and P...
متن کاملWeld for Itanium Processor
Sharma, Saurabh Weld for Itanium Processor (Under the direction of Dr. Thomas M. Conte) This dissertation extends a WELD for Itanium processors. Emre Özer presented WELD architecture in his Ph.D. thesis. WELD integrates multithreading support into an Itanium processor to hide run-time latency effects that cannot be determined by the compiler. Also, it proposes a hardware technique called operat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Concurrency and Computation: Practice and Experience
دوره 18 شماره
صفحات -
تاریخ انتشار 2006